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Abstract 



We propose a novel model for nonlinear dimension reduction motivated by the probabilistic formu- 
lation of principal component analysis. Nonlinearity is achieved by specifying different transformation 
matrices at different locations of the latent space and smoothing the transformation using a Markov 
random field type prior. The computation is made feasible by the recent advances in sampling from von 
Mises-Fisher distributions. 



Index Terms 

Dimensionality reduction, Gibbs sampling, Markov random field, Principal component analysis. 



I. Introduction 

Principal component analysis (PCA) is an old statistical technique for unsupervised dimension 
reduction. It is often used for exploratory data analysis with the objective of understanding 
the structure of the data. PCA aims to represent the high dimensional data points with low- 
dimensional representers commonly called latent variables, which can be used for visualization, 
data compression etc. Sometimes PCA is also used as a preprocessing step before regression [1] 
or clustering [2]. In these context, however, PCA typically does not have satisfying performance 
due to the ignorance of subsequent analysis. 

We denote the original high dimensional data by Y = {yx, y 2 , . . . y n } 7 \ where = {yn, . . . , yi P } T £ 
R p . Note that the superscript T is used to denote transposition so that y^ is a column vector. 
We assume the data are already centered so that y = Ym=i Vil n = 0- One common definition of 

PCA is that of taking a linear combination of the components of yf. 

p 

Xi = ^UijVfri = 1,2, . . . ,n 

3=1 

where Vj is the weighting coefficient of the j-th covariate. This can be written as 

Xi = v T yi (1) 

where v = (vi, . . . ,v p ) T . We take \\v\ \ = 1 so that (OQ) represents a projection onto the linear 
subspace spanned by v. Given v and x^, the optimal linear reconstruction of y^ is given by 
yi = vxi. We want to be a good representation of the original y^ Thus we aim to minimize 
Si \ \Vi ~ Vi\\ 2 = S \ \Vi ~ vx i\\ 2 - It can b e shown that the minimizing v is the eigenvector of 
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Y T Y/n = J2yfyi/ n associated with its largest eigenvalue, called the first principal component 
and denoted by v\. Similarly, we can define did < p) principal components vi, . . . ,v d as the 
minimizer with respect to V of the total squared reconstruction error — Vi\\ 2 , where 

jji = Vx u V = (v 1: v 2 ,..., v d ), V T V = I dxd , and x t = V T yi e R d is the projection of y t onto 
the subspace spanned by the columns of V, the principal components. 

PCA is a linear procedure since the reconstruction is based on a linear combination of the 
principal components. Several nonlinear extensions have been proposed. The most famous one in 
the statistical literature is the principal curves proposed in [3]. The principal curve is defined as 
the curve such that each point on the curve is the center of all the data points whose projection 
onto the curve is that point. Thus visually the principal curve is defined as the curve that passes 
through the "middle" of the data points. Although conceptually appealing, the computational 
constraint makes it difficult to extend this approach to higher dimensions. Other approaches 
including neural networks [4], kernel embedding [5], and generative topographic mapping [6] 
have been proposed. 

The absence of probabilistic models in traditional PCA motivated the probabilistic PCA 
(PPCA) approach adopted by [7]. The advantage of probabilistic modeling is multifold, including 
providing a mechanism for density modeling, determination of degree of novelty of a new data 
point, and naturally incorporating incomplete observations. In [7], the generative model is defined 
through the observation equation: 



which stated the linear relationship between the latent variable and the data points, W is a p x d 
matrix that is not constrained to have orthogonal columns a priori, and are i.i.d. noises with 
€i ~ N(0, a 2 I pxp ). Note we assume that the data is already centered, otherwise the observation 
model should be changed to 



with shift parameter /x. In PPCA, we put a zero mean, unit covariance Gaussian prior on Xi, and 
the likelihood is maximized over (W, a 2 ) after marginalizing over a;,: 



Hi = WXi + €i 



(2) 



Hi = WXi + fX + €i 




February 9, 2008 



DRAFT 



4 



It is shown that when the noise level a goes to zero, the maximum likelihood estimator for W 
will converge to 

W = VD, (3) 

where the matrix V and D comes from singular value decomposition of Y j \fn = UDV T . Thus 
PPCA is a natural extension of the traditional PCA. 

[8] extends PPCA to mixture PPCA which can be used to model nonlinear structure in the 
data. In PPCA, after marginalizing over Xi, the distribution of yi becomes N(/j,, WW T + a 2 I) 
if the data are not centered. The mixture PPCA models the marginal distribution of jji as 

M 
m=l 

a mixture with M components, and for each component, the observation model is 

Vi = W m Xi + fJ, m + ei 

if the 2-th observation comes from the m-th mixture component. Thus in mixture PPCA, each 
mixture component is defined by a different linear transformation, while clustering is defined on 
the original p— dimensional space. Marginalization over x« is still the same using unit covariance 
Gaussian distribution. The maximization over {W 7 ™} and {/i m } can be performed using EM 
algorithm taking the mixture indicators as the missing data. The experiments in [8] showed that 
this model has a wide applicability. We also note that when using to reconstruct the data point 
Ui, we must also store the mixture component which is responsible for generating X{, or, more 
preferably, the posterior responsibility of each mixture for the i— th observation. This piece of 
information cannot be recovered from the latent variable Xi alone. 

Another approach of probabilistic nonlinear PCA based on Gaussian processes has been 
proposed in [9]. It starts from the same observation model ©, but instead of marginalizing over 
x^ it marginalizes over W by putting independent spherical Gaussian prior on the d columns 
of W, resulting in the marginal distribution of y.j ~ iV(0,XX T + cr 2 I), where y.j is the j-th 
column of Y and X is the n x d matrix of latent variables. The author noticed that one can 
replace XX T + a 2 1 with another kernel matrix to achieve nonlinearity. Conceptually, this can be 
regarded as multivariate nonparametric regression problem = f(xi) + €i with Xi unknown, and 
need to be found by optimization of the likelihood. The computational complexity of Gaussian 
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process approach is cubic in the number of data points n, although approximation algorithm can 
be designed to reduce the complexity. 

In this contribution, we propose a novel Bayesian approach to nonlinear PCA which puts 
priors on both x and V. The model is based on an observation model similar to (O, but with 
two differences. First, the linear transformation is defined through the orthonormal matrix V 
instead of W which roughly corresponds to V D in PPCA. Second, the linear transformation 
V in our model is dependent on the corresponding latent variable. The linear transformations 
in different parts of the latent space are related by putting a Markov random field prior over 
the space of orthonormal matrices which makes the model identifiable. The model is estimated 
by Gibbs sampling which explores the posterior distribution of both the latent space and the 
transformation space. The computational burden for each iteration of Gibbs sampling is square 
in the number of data points. 

The rest of the paper is organized as follows: In the next section, we present the Baysian model 
and discuss the Gibbs sampling estimation procedure. Since we think the readers might not be 
familiar with the von Mises-Fisher distribution, some background material is also provided. Some 
experiments are carried out in section 3 using both simulated manifold data and the handwritten 
digits data. We conclude in section 4 with some thoughts on possible extensions of the model. 

II. Bayesian Nonlinear PCA 

A. Stiefel Manifold and von Mises-Fisher Distribution 

Orthonormal matrices play a key role in our Bayesian model. By definition, the set of n x d 
matrices X with X T X = I nxn is called the Stiefel manifold and denoted by v n ^. This is a 
compact manifold. The most common probability distribution on the Stiefel manifold is the von 
Mises-Fisher distribution with a density with respect to the uniform distribution on the Stiefel 
manifold, which has an exponential family form: 

p(X\C) oc exp{tr(C T X)} 

where C is a matrix of the same dimension as X and the normalizing constant is omitted above. 
This distribution is denoted by vMF{C). Note vMF(0) is just the uniform distribution on the 
Stiefel manifold. 
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Suppose the singular value decomposition of C is C = UDV T , with U and V being nxd and 
d x d orthonormal matrices, and D a diagonal matrix containing the singular values of C. The 
density p(X\C) is maximized at X = UV T which gives the "most likely" matrix from the Stiefel 
manifold under this distribution. The diagonal matrix D can be regarded as the concentration 
parameter of the distribution which determines the closeness of samples to the mode. Larger 
entries in D makes the distribution more peaked around the mode UV T . 

Sampling from von Mises-Fisher distribution has been studied in detail in [10]. Two efficient 
algorithms are proposed. One is the rejection sampling approach. The simplest proposal dis- 
tribution for rejection sampling is the uniform distribution on the Stiefel manifold. Sampling 
randomly from i/ nd can be done as follows [11]: 

• Sample U\ uniformly from the unit sphere S n -i, and set V\ = U\. 

• Sample u 2 uniformly from the unit sphere S n -2 and set v 2 = N\U2 where N\ is an 
orthonormal matrix whose columns spanned the subspace orthogonal to V\. 

• Sample Ud uniformly from the unit sphere S n -d and set = NdUd where Nd is an 
orthonormal matrix whose columns span the subspace orthogonal to vi, . . . ,Vd- 

In [10], more efficient rejection sampling is presented using a better proposal distribution. Yet 
another approach in [10] is to use iterative Gibbs sampling on each column of X based on the 
full conditional density. In our implementation, we use the rejection sampling approach, the R 
code of which is available from the website of the author of [10]. In [11], von Mises-Fisher 
distribution aided with Gibbs sampling is used to build a Bayesian model for PCA. Our model 
can also be regarded as a nonlinear extension of that work. 

B. Nonlinear PCA model with MRF 

The observation model of our Bayesian approach is similar to @ but with the additional 
flexibility that the linear transformation is dependent on the latent variable: 

V% = V Xi Xi + €i (4) 

V Xi , i = 1, 2, ... n are constrained to be orthonormal and depends on the latent variable Xj. This 
is one difference with previous approaches in [7], [8], [9], where the transformation matrix W 
roughly corresponds to principle directions properly scaled by the singular values of the data 
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matrix, see ([3]). The prior on the noise is the same as before: ~ N(0, a 2 I nxn ). We use a 
conjugate prior Gamma(r],7]r 2 /2) on the precision parameter 1/cr 2 so that the expectation of 
1/cr 2 is 1/r 2 . The prior on xi is an isotropic Gaussian Xi N(0, a 2 Id X d)- Note we don't 
necessarily have a = 1 here. The reason is that after putting the orthonormal constraint on V x , 
the scale information of the data is shifted to the latent variable x. In our implementation, we set 
a 2 to be the sample variance of each covariate of the data points, and averaged over p covariates. 
We find the result is insensitive to the choice of a as long as a is large enough. It is also as 
good to use the (improper) uniform prior for Xj. 

An important task is the specification of the prior for V Xi ,i = 1,2, . . .n. Independent prior 
obviously will not work here since the parameter V x typically has more degrees of freedom 
than can be estimated by the single constraining equation ©. Therefore, we seek a prior that 
takes into account the correlation of transformation matrices for all i simultaneously. A natural 
correlation between those orthonormal matrices can be introduced by the assumption that the 
transformation evolves slowly over the latent space. That is, the closeness of x^ and Xj for a 
pair (xi,Xj) as measured by the Euclidean distance in the latent space implies the closeness of 
V x . and V Xj on the Stiefel manifold. 

Markov Random Field (MRF) is particularly useful for studying spatial models where the 
strength of interaction between random variables depends on the closeness of the corresponding 
sites. It has been widely used in image analysis and computer vision (e.g. [12], [13]). Formally, 
let 5* be a finite index set representing the sites, with a random variable Z s associated with 
each site s 6 S and taking values in a subset of a Hilbert space with inner product (-,•). A 
neighborhood system is defined on the sites so that the full conditional probability of Z s only 
depends on its neighbors. One can think of the neighborhood system as an undirected graph 
where each vertex represents one site and two sites are neighbors of each other if and only if 
there is an edge connecting the two vertices. Although generally the conditional probabilities 
uniquely determines the joint distribution, the existence of the joint distribution is difficult to 
ascertain from the conditional ones. Thus it is generally more convenient to start by defining 
the joint distribution of the random variables. 

One simple example of MRF is defined by the joint distribution of all random variables: 

p({Z s }) oc exp\ S ^ j X st (Z s Z t )} 
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where the sum is over all pairs (s, t) that are neighbors of each other. This distribution represent 
the pairwise interactions of random variables between neighbors. In our case, the sites are 
represented by the position of the latent variables Xi in the latent space R d . At each site, we 
attach a random variable V Xi taking values on the Stiefel manifold. The MRF prior for the n 
orthonormal matrices V Xi ,i — 1, . . . , n is defined by the joint density with respect to the uniform 
measure: 

KKJIM) oc exp{tr{Y J \ j Vj i V X3 )} 

where the sum is over all pairs of data points, i.e., the neighborhood system is defined by 
the complete graph that puts an edge between all pairs of sites. For ease of notation, this 
joint distribution is denoted by MRF(Xij). The scalar \j represents the strength of interaction 
between sites i and j and its choice is discussed later. Thus in our prior, the full conditional 
probability p(V Xi \V Xj ,j ^ i) (omitting the conditioning on for simplicity) cannot be further 
reduced. The interaction between variables in this model is still additive in a pairwise manner 
though. 

The above probability density is well defined since the Stiefel manifold is compact and the 
normalizing constant can be found at least in theory. The conditional probability is trivially 

p(V Xi \V Xj ,j + i) oc exp{tr((J2 ^^f^)} 

which is a von Mises-Fisher density with parameter C = £\ j^i\jV Xj . 

As discussed previously, the mode of the conditional distribution p(V Xi \V Xj ,j ^ i) can be 
found from the singular value decomposition of the matrix £\ \jV Xj . The decomposition is 
difficult to find in closed form, but some approximation can give some insight into the prior. 
Suppose that Ay is large when Xi and Xj are close and negligible when they are distant from 
each other. Besides, if for those Xj close to x h the corresponding matrices V Xj are also close 
to each other and approximated by a common orthonormal matrix V, then ^\ . , 4 \jV Xj can be 
approximated by (^j^K^V. The mode of the distribution is approximately V and Ylj^i^ij 
determines the concentration of the distribution. So the effect of the MRF prior is to smooth the 
transformation matrices V Xi so that sites close by in the latent space are associated with similar 
transformations. 

By the above discussion, we would like to specify A^- as a decreasing function of the Euclidean 
distance between xi and 1 1. In this work, we make use of a Gaussian kernel function 
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for this purpose: 

Xij = af>(\\xi - xjW/w) 

where <f)(x) = exp{—x 2 /2}. The kernel width w determines the relative influence of different 
sites and the parameter c is related to the concentration of the conditional distribution and thus 
affects the "smoothness" of the joint distribution of {V Xi }. 

Summarizing, we use the following model for nonlinear dimension reduction: 

Hi = V Xi Xi + €i 

{V Xi }\{xi} ~ MRF({X ij }),X ij = af>(\\x i -x j \\/w) 

Xi ~ N(0,a 2 I) 

€i\a 2 ~ N(0,a 2 I) 

— ~ Garama{r],r]T 2 /2) 

We choose a 2 to be a large number or even infinity. Similar to [11], we set the "prior sample 
size" 77 = 2, and r 2 is derived from a pilot dimension reduction study such as the traditional 
PCA. For example, we can use r 2 = % I \v% — V%\ \ 2 / n P, where yi is the reconstructed data point 
from d— dimensional PCA. The choice of c and w is more difficult. For full Bayesian analysis, 
we should put a prior on c and w also. But this will cause computational difficulty with Gibbs 
sampling. In our experience, the choice w = Yl,i<j I \ x i ~~ x j 1 1/ (2) ant ^ c = 100/ n generally gives 
satisfactory results. 

C. Posterior Computation 

We propose using Gibbs sampling for posterior computation. The full conditional distribution 

of V Xi is 

p(V Xi \V Xj ,j^i,{x k } n k=1 ,<j 2 ,Y) 
oc ex P {- ^ ~ V ^iy ~ } ■ ex P {tr(£ KjVj.V Xj )} 

oc exp{tr(VT[ yi xl /a 2 + ^ \jV Xj ]} 

The expressions for other full conditional distributions are standard and their derivations omitted. 
The Gibbs sampling then iterates between the following steps. 
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. update V x ., for i = 1, ... , n, by sampling from vMF(C) with C = Vixf/a 2 + Yl&i ^ij V x r 

• update the latent variables Xi, for z = 1, . . . , n, by sampling from 

. r\ . Ob rp (X CJ . 

Xi\V x .,y h a ~ N{— — -V. ' y h ). 

cr + a 1 1 a z + o l 

• update the parameter a 2 by sampling 1/cr 2 from Gamma((r] + np)/2, (i]T 2 + J2i \ \Vi ~ 

The Gibbs sampling algorithm is initialized using standard PCA, setting the parameters and 
variable to the corresponding variables obtained from singular value decomposition of the data 
matrix. For statistical inferences of the parameters, the most convenient approach is to use the 
posterior sample average after the "burn in" period. 

III. Experimental Results 

In this section, we perform some limited experiments to illustrate our nonlinear Bayesian 
model for dimension reduction. 

To demonstrate the nonlinearity of the model, we sample 100 points on the unit sphere with 
noise level a = 0.05. The data is shown on Fig [TJ The Bayesian model is fitted with latent 
space dimension d = 2. The reconstructed data points from the latent space representation is 
also shown on Fig. [Q We can compare the histograms of the distances of data points to the 
surface. We also show the histogram of the reconstruction errors to illustrate the accuracy of the 
reconstruction. 

One popular dataset for visualization of dimension reduction is handwritten digits. We therefore 
use the MNIST database of handwritten digits and apply the model to a subset of 150 digits 
1, 2, 3 (50 of each). The image dimension of this dataset is 28 x 28. To reduce computational 
complexity, we subsampled the images so that the dimension is reduced to 14 x 14. The position 
of each image in the latent space is shown in Fig. [2l together with that obtained by traditional 
PCA. An objective assessment can be obtained by counting the number of images whose nearest 
neighbor in the latent space represents a different digit. For traditional PCA, we have 53 such 
images, while we only have 25 such images in our new model. 

IV. Conclusion 

We have presented a novel Bayesian framework for performing nonlinear principal component 
analysis. Each data point is associated with a different transformation and the different transfor- 
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(c) (d) (e) 

Fig. 1 

(a) Simulated 100 points with added noise on the surface of a unit sphere, (b) Reconstructed 100 points. 
(c) Histogram of distances of 100 simulated points to the sphere, (d) Histogram of distances of 

RECONSTRUCTED POINTS TO THE SPHERE. (E) HISTOGRAM OF RECONSTRUCTION ERRORS. 



mations are smoothed by a MRF type prior. We demonstrated with some experiments that our 
new model can discover nonlinear structure underlying the datasets. 

As in traditional PCA, dimension selection is a difficult problem in our problem. We are 
currently investigating the possibility of automatic dimension selection as done in [1 1] by putting 
a prior on the dimension. This seems to be a viable approach. 

Although the computational complexity for our model is square in the number of samples, 
which compares favorably with the approach adopted in [9]. It is still desirable to reduce the 
computation if possible. The MRF prior used in our current implementation corresponds to a 
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(a) (b) 
Fig. 2 

Visualization results for the handwritten digits data, (a) Projection given by PCA. (b) Projection given 

BY OUR MODEL. ' 1 ' IS REPRESENTED BY CIRCLES, '2' BY TRIANGLES, AND '3' BY PLUSES. 



complete graph. It is possible to use a sparser graph that only connects nearby points in the 
latent space. This strategy can further reduce the computational complexity. 
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